Garbo

home *** CD-ROM | disk | FTP | other *** search

/ Garbo / Garbo.cdr / mac / science / ktext094.sit / KTEXT User's Guide next >

Wrap

Text File | 1991-05-06 | 85KB | 2,002 lines

KTEXT User's Guide Evan L. Antworth Summer Institute of Linguistics evan@txsil.lonestar.org May 6, 1991 KTEXT version 0.9.4 1 Overview of KTEXT 1.1 What does KTEXT do? 1.2 Placing KTEXT in its context 1.3 Technical specifications 1.4 Program status 2 Example of using KTEXT to process a text 3 Running KTEXT 4 KTEXT's functional structure 5 The text data file 6 The main control file 7 The TXTIN control file 7.1 Text orthography changes 7.2 Words or format markers? 7.3 Selecting fields 7.4 Special output characters 7.5 Controlling capitalization 7.6 A sample text input control file 8 The output data file 9 CED: an editor for failures and ambiguities 9.1 Overview of CED 9.2 Starting the CED editor 9.3 Editing for text glossing 9.4 The editing process 9.5 Command summary Notes References 1 Overview of KTEXT This section briefly describes what KTEXT does, places KTEXT in its computational context, lists technical specifications of the program, and gives information on use and support of the program. 1.1 What does KTEXT do? KTEXT is a text processing program that uses the PC-KIMMO parser (see below about PC-KIMMO). KTEXT reads a text from a disk file, parses each word, and writes the results to a new disk file. This new file is in the form of a structured text file where each word of the original text is represented as a database record composed of several fields. Each word record contains a field for the original word, a field for the underlying or lexical form of the word, and a field for the gloss string. For example, if the text in the input file contains the word hoping (to use an English example), KTEXT's output file will have a record of this format: \a V(hope)+PROG \d hope+ing \w hoping This record consists of three fields, each tagged with a backslash code.[1] The first field, tagged with \a for analysis, contains the gloss string for the word. The second field, tagged with \d for (morpheme) decomposition, contains the underlying or lexical form of the word. And the third field, tagged with \w for word, contains the original word. The word spies demonstrates how KTEXT handles multiple parses: \a %2%N(spy)+PLURAL%V(spy)+3SG% \d %2%spy+s%spy+s% \w spies Percent signs (or some other designated character) separate the multiple results in the \a and \d fields, with a number indicating how many results were found. A word record also saves any capitalization or punctuation associated with the original word. For example, if a sentence begins "Obviously, this hypothesis.", KTEXT will output the first word like this: \a ADJ(obvious)+ADVR \d obvious+ly \w obviously \c 1 \n , The \w field contains the original word without capitalization or the following comma. The \c field contains the number 1 which indicates that the first letter of the original word is upper case. The \n field contains the comma that follows the original word. The purpose of retaining the capitalization and punctuation of the original text is, of course, to enable one to recover the original text from KTEXT's output file. The output of KTEXT is not intended to be an end in itself. While there may be some usefulness in directly examining the data structures produced by KTEXT, the intention is to use KTEXT's output as the basis of further data processing. A number of applications could use the kind of morphologically parsed text that KTEXT produces, including syntactic parsers, concordance programs, and machine translation programs. 1.2 Placing KTEXT in its context KTEXT can only be understood by describing two other programs: PC-KIMMO and AMPLE. First, we will take a look at PC-KIMMO. KTEXT is intended to be used with PC-KIMMO (though it is a stand-alone program). PC-KIMMO is a program for doing computational phonology and morphology. It is typically used to build morphological parsers for natural language processing systems. PC-KIMMO is described in the book "PC-KIMMO: a two-level processor for morphological analysis" by Evan L. Antworth, published by the Summer Institute of Linguistics (1990). The PC- KIMMO software is available for MS-DOS (IBM PCs and compatibles), Macintosh, and UNIX. The book (including software) is available for $23.00 (plus postage) from: International Academic Bookstore 7500 W. Camp Wisdom Road Dallas TX, 75236 U.S.A. phone 214/709-2404 fax 214/709-2433 The KTEXT program which this document describes will be of very little use to you without the PC-KIMMO program and book. The remainder of this document assumes that you are familiar with PC-KIMMO. PC-KIMMO was deliberately designed to be reuseable. The core of PC-KIMMO is a library of functions such as load rules, load lexicon, generate, and recognize. The PC-KIMMO program supplies on the release diskette is just a user shell built around these basic functions. This shell provides an environment for developing and testing sets of rules and lexicons. Since the shell is a development environment, it has very little built-in data processing capability. But because PC-KIMMO is modular and portable, you can write your own data processing program that uses PC-KIMMO's function library. KTEXT is an example of how to use PC- KIMMO to create a new natural language processing program. KTEXT is a text processing program that uses PC-KIMMO to do morphological parsing. KTEXT is also closely related to a program called AMPLE (Weber et al. 1988), which is also a morphological parser designed to process text. KTEXT was created by replacing AMPLE's parsing engine with the PC-KIMMO parser. Thus KTEXT has the same text-handling mechanisms as AMPLE and produces output similar or even identical to AMPLE. The advantages of this design are (1) we were able to develop KTEXT very quickly and easily since it involved very little new code, and (2) existing programs that use AMPLE's output format can also use KTEXT's output. The disadvantage of basing KTEXT on AMPLE is that the format of the output file is perhaps not consistent with terminology already established for PC-KIMMO. 1.3 Technical specifications KTEXT runs under three operating systems: MS-DOS (IBM PC compatibles), UNIX System V (SCO UNIX V/386 and A/UX) and 4.2 BSD UNIX, and Apple Macintosh. KTEXT does not require any graphics capability. It handles eight- bit characters (such as the IBM extended character set). It requires a minimal amount of memory (at least 256KB on an IBM PC compatible), but more memory is needed to load large lexicons. The Macintosh version has the same user interface as the DOS and UNIX versions, namely a batch-processing, command-line interface. In other words, it does not use the Macintosh mouse, menus, and windows interface. The program is written entirely in C and is very portable. The Macintosh version was compiled with the Lightspeed Think C compiler. 1.4 Program status KTEXT was developed by Steven McConnel and Evan Antworth of the Summer Institute of Linguistics. KTEXT version 0.9 is a beta test version. Its features are subject to change. Several qualifications apply to its use and support: (1) This software, source code and executable program, is copyrighted by the Summer Institute of Linguistics. You may use this software at no cost for whatever purpose you see fit. You are granted the right to distribute this software to others, provided that all files are included in unmodified form and that you charge no fee (except cost of media). This software is intended for academic use only, and may not be distributed or used for commercial profit without express permission of the Summer Institute of Linguistics. (2) This software represents work in progress and bears no warranty, either expressed or implied, of its fitness for any particular purpose. (3) In releasing this software , the Summer Institute of Linguistics is making no commitment to maintain it. It is, however, committed to forwarding user feedback to the software's authors who may or may not choose to develop the software further. Bug reports, wish lists, requests for support, and positive feedback should be directed to Evan Antworth at this address: Evan Antworth Academic Computing Department Summer Institute of Linguistics 7500 W. Camp Wisdom Road Dallas, TX 75236 phone: 214/709-2418 e-mail: evan@txsil.lonestar.org 2 Example of using KTEXT to process a text Typically, the steps involved in using KTEXT are: (1) Collect a corpus of language data suitable for phonological and morphological analysis (typically paradigms of words). (2) Do phonological and morphological analysis on the data. (3) Use the PC-KIMMO shell to develop a rules file and a lexicon file that encode your phonological and morphological analyses and to test them against your corpus of data. (4) Select a text and keyboard it. (5) Set up the control files required by KTEXT. (6) Using the rules and lexicon you developed, process the text with KTEXT. (7) Edit KTEXT's output file to remove multiple parses. (8) Use the edited file as input to some other program. To demonstrate how to use KTEXT to process a text, we will use a folktale text taken from Leonard Bloomfield's (1917) collection of Tagalog[2] texts. The first step in the project was to analyze the phonology and morphology of Tagalog and develop the rules and lexicon files for PC-KIMMO. The phonology and morphology of Tagalog are rather complex. Verbs in particular exhibit a considerable amount of both derivational and inflectional morphology. One of the more exotic features of Tagalog morphology is its pervasive use of infixes and reduplication. For example, the root lçkad is made into a verb by placing the infix um after the first consonant of the root to produce lumçkad. The durative aspect of this verb is signaled by reduplicating the first consonant and vowel of the root to produce lçlçkad. The two processes can be combined to produce lumçlçkad. In addition to this morphological complexity, at least a dozen rules are required to account for various morphophonemic processes, including coalescence, stress shift, and syncope. For example, the underlying form bilô+in is realized as the surface form bilhôn. In the two-level model, these forms are related like this: UF: b i l ô 0 + i n SF: b i l 0 h 0 ô n Rules are required to account for the syncopation of ô, the insertion of h, and the shift of stress from the last syllable of the root to the suffix. After the rules and lexicon had been written and tested using PC-KIMMO, the next step was to keyboard the chosen text. The first paragraph of the text is shown in figure 1. Figure 1 Fragment of a Tagalog text \ti Aû ulÿl na uûgÿ at aû mar£noû na pagÿû. \p \s MÆnsan aû pagÿû hçbaû nalôlÆgo sa Ælog, ay nakêkÆta syê naû isa_û p£no_û-sçgiû na lum¥l£taû at tinçtaûêy naû çgos. \s HinÆla niya sa pasÆgan, dçtapwat hindÆ nya madalê sa l£paq. \s Dçhil dÆto tinçwag nya aû kaybÆgan niya_û uûgÿq at iniyçlay nyê aû kap£tol naû p£no_û-sçgiû kuû itçtanim nyê aû kanyê_û kapartÅ. \s Tumaûÿq aû uûgÿq at hinçte nilê sa gitnêq mulç sa magkçbila_û d£lo aû p£no naû sçgiû. \s Inaûkôn naû uûgÿ aû kap£tol na mçy maûa dçhon, dçhil sa panukçlê nya na iyÿn ay t¥t£bo na mab£ti kçy sa kap£tol na wala_û dçhon. The text was keyboarded using a very simple system of document markup that tags parts of the document with backslash codes. The \ti tag indicates the title of the story, the \p tag indicates the beginning of a paragraph, and the \s tag indicates the beginning of a sentence. A few small adjustments to the original transcription were made. For instance, where Bloomfield wrote enclitics separate from the preceding word, they have been joined with the underline character: isa_ng. The next step was to process the keyboarded text with KTEXT. A fragment of the resulting output file is shown in figure 2. Figure 2 Output of KTEXT \a < DET S > \d aû \w \\ti \c 1 \a < AJ foolish > \d ulÿl \w ulÿl \a %2%< PRT LKR >%< PRT ENC >% \d %2%na%nê% \w na \a < N1 monkey > \d uûgÿq \w uûgÿ \a < CNJ and > \d at \w at \a < DET S > \d aû \w aû \a AJR < N2 wisdom > \d ma-d£noû \w mar£noû \a %2%< PRT LKR >%< PRT ENC >% \d %2%na%nê% \w na \a < N1 turtle > \d pagÿû \w pagÿû \n .\n\n This is as far as KTEXT takes us. What you do with KTEXT's output is limited only by your imagination and ingenuity. One obvious way to continue is to reassemble the text in interlinear format. That is, we could write a program that would take the data structures shown in figure 2 and create a new file where the text is stored in interlinear format. The resulting interlinear text is shown in figure 3. An interlinear text editor like IT[3] could then be used to add more lines of annotations to the text. Figure 3 A Tagalog example of interlinear text format Ang ulÿl na unggÿ at ang mar£nong na pagÿng. ang ulÿl na unggoq at ang ma- d£nong na pagÿng S foolish LKR monkey and S AJR-wisdom LKR turtle Interlinear translation is a time-honored format for presenting analyzed vernacular texts. An interlinear text consists of a baseline text and one or more lines of annotations that are vertically aligned with the baseline. In the text shown in figure 3, the first line is the baseline text. The second line provides the lexical form of each original word, including morpheme breaks. The third line gives the gloss of each word or morpheme. Grammatical morphemes are glossed with abbreviations in all capital letters and lexical morphemes are glossed with equivalent English words. For instance, the word mar£nong in the first line is written as two morphemes in the second line: ma-d£nong (notice the phonological alternation between d and r). The third line gives its gloss, AJR-wisdom, where AJR stands for an adjectivizer prefix that changes the noun stem d£nong 'wisdom' into an adjective meaning 'wise'. Another way to proceed would be to take the output of KTEXT as shown in figure 2 and format it directly for printing. In other words, there would be no disk file of interlinear text corresponding to figure 3; rather, the interlinear text is created on the fly as it is prepared for printing. Fortunately, the software required to print interlinear text is now available. As a complement to the IT program, a system for formatting interlinear text for typesetting has recently been developed (see Kew and McConnel, 1991). Called ITF, for Interlinear Text Formatter,[4] it is a set of TEX[5] macros that can format an arbitrary number of aligning annotations with up to two freeform (nonaligning) annotations. While ITF is primarily intended to format the data files produced by IT (similar to the interlinear text shown in figure 3), an auxiliary program provided with ITF accepts the output of the KTEXT program. The final printed result of the formatting process is shown in figure 4.[6] It should be noted that this is just one of many formats that ITF can produce. Because ITF is built on a full-featured typesetting system, virtually all aspects of the formatting detail can be customized, including half a dozen different schemes for laying out the freeform annotations relative to the interlinear text. 3 Running KTEXT This section describes KTEXT's user interface and the input files it uses. KTEXT is a batch-processing program. This means that the program takes as input a text from a disk file and returns as output the processed text in a new disk file. KTEXT is run from the command line by giving it the information it needs (file names and other options). It does not have an interactive interface. The user controls KTEXT's operation by means of special files that contain all the information KTEXT needs to process the input text. These files are called control files. Here is an example of running KTEXT on an English text (an excerpt from Lewis Carroll's Alice's Adventures in Wonderland). At the operating system prompt, type "ktext" plus various command line options: C:\>ktext -w -x english.ctl -i alice.txt -o alice.ana -l alice.log The following will appear on the screen: KTEXT TWO-LEVEL PROCESSOR Version 0.9.4 (11 March 1991), Copyright 1991 SIL Using the following as word-formation characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-' Rules being loaded from english.rul Lexicon being loaded from english.lex .................................................................. .................................................................. ............ Each dot represents one word successfully processed. When the program is done, it will return you to the operating system prompt. To see a list of the command line options, type "ktext -h". You will see a display similar to this: -c <char> make <char> the comment character (default is ;) -t set tracing on (default is off) -w include \w field in output file(default is no \w field) -x <ctlfile> specify the control file name (default is ktext.ctl) -i <infile> specify the input data file name -o <outfile> specify the output file name -l <logfile> specify the log file name (default is none) The command line options (-w, -x, and so on) are all lower case letters. Here is a detailed description of each command line option. -c The -c option takes an argument that sets the comment character used in the PC-KIMMO rules and lexicon files. It has no effect on any other files used by KTEXT except these two. If the -c option is not used, the default PC-KIMMO comment character is used, namely semicolon (;). -t The -t option turns the PC-KIMMO tracing mechanism on. This displays on the screen everything the parser is doing when it processes a word. Tracing is used for debugging the rules and lexicon, and is better used with the PC-KIMMO shell program. -w The -w option causes the \w field to be included in each word record of the output file. The \w field contains the original word from the text. If you don't include the -w option, the word records of the output file will contain only the \a (analysis) and \d (morpheme decomposition) fields. -x The -x option takes an optional argument that specifies the name of the main KTEXT control file. This main control file contains the name of the TXTIN control file and the names of the rules and lexicon files. It can also specify consistent changes to be made to the output fields. The -x option accepts a default file name extension of CTL; for example if you use "- x english" KTEXT will try to load the file "english.ctl". If the -x option is not used, KTEXT will try to load a control file with the default file name KTEXT.CTL. -i The -i option takes an obligatory argument that specifies the name of the input file containing the text that KTEXT will process. If the -i option is not used, KTEXT will prompt you to enter the name of the input file. -o The -o option takes an obligatory argument that specifies the name of the output file that KTEXT creates. If the -o option is not used, KTEXT will prompt you to enter the name of the output file. If a file with the same name already exists, KTEXT will will ask for confirmation that you want to overwrite it. -l The -l option takes an obligatory argument that specifies the name of a log file. The log file will contain any analysis failures or other anomalous behavior during processing of the input text. In all instances where file names are supplied to KTEXT, an optional directory path can be included; for example, -i c:\texts\alice.txt. 4 KTEXT's functional structure KTEXT has two main functional modules: the TXTIN module and the ANALYSIS module. The diagram in figure 5 shows the flow of data through these modules. The input text is fed into the TXTIN module which outputs the text as a stream of normalized words with capitalization and punctuation stripped out and saved. The TXTIN module also uses a control file that specifies orthographic changes. Each word is then passed to the ANALYSIS module where it is parsed and output as a database record. The ANALYSIS module also uses the PC-KIMMO rules and lexicon files. Figure 5 Functional structure of KTEXT input text | | +------------------------+ | | | | +--------------+ | text input | | | | control file---->| | TXTIN | |--------+ | | | | | | +--------------+ | | | | | punctuation | words | white space | | | capitalization | +--------------+ | format marking rules and | | | | lexicon files--->| | ANALYSIS | | | | | | | +--------------+ | | | | +------------------------+ | | parsed output KTEXT uses five input files and produces one output file (plus an optional log file). These five input files are: the text data file, the main control file, the TXTIN control file, the PC-KIMMO rules file, and the PC-KIMMO lexicon file. The PC-KIMMO rules and lexicon files are described in the PC-KIMMO book (Antworth 1990) and will not be discussed further in this document. The other input files and the output data file are described in the following sections. 5 The text data file The text data file contains the text that KTEXT will process. It must be a plain text file, not a file formatted by a word processor. If you use a word processor such as Microsoft Word to create your text, you must save it as plain text with no formatting. KTEXT preserves all the "white space" used in the text file. That is, it saves in its output file the location of all line breaks, blank lines, tabs, spaces, and other nonalphabetic characters. This enables you to recover from the output file the precise format and page layout of the original text. While KTEXT will accept text with no formatting information other than white space characters, it will also handle text that contains special format markers. These format markers can indicate parts of the text such as sentences, paragraphs, sections, section headings, and titles. The use of special format markers is called descriptive markup. KTEXT (because it is based on AMPLE) works best with a system of descriptive markup called "standard format" that is used by the Summer Institute of Linguistics. SIL standard format marks the beginning of each text unit with a format marker. There is no explicit indication of the end of a unit. A format marker is composed of a special character (a backslash by default) followed by a code of one or more letter. For example, \ti for title, \ch for chapter, \p for paragraph, \s for sentence, and so on. KTEXT does not "know" any particular format markers. You can use whatever markers you like, as long as you declare them in the TXTIN control file. For more on format markers, see section 7.2.2 below. One of the best know systems of descriptive markup is SGML (Standard Generalized Markup Language). One very significant difference between SGML and SIL standard format is that SGML uses markers in pairs, one at the beginning of a text unit and a matching one at the end. This should not pose a problem for KTEXT, since KTEXT just preserves all format markers wherever they occur. Another difference is that SGML flags format markers with angle brackets, for instance <paragraph>. KTEXT can recognize SGML markers by changing the format marker flag character from backslash to left angled bracket (see section 7.2.2 below). Recognizing the end of the SGML format marker is a bit of a problem. While SGML uses a matching right angled bracket to indicate the end of the marker, SIL standard format simply uses a space to delineate the format marker from the following text. This means that for KTEXT to find the end of an SGML tag, you must leave at least one space after it. 6 The main control file The main control file controls various aspects of KTEXT's operation. It is structured as a standard format database, composed of various fields marked by backslash codes. Figure 6 shows the fields available in the main control file: Figure 6 Main control file field codes Code Description --------- ----------------------- \textin name of text control file \rules name of PC-KIMMO rules file \lexicon name of PC-KIMMO lexicon file \ach change in \a field \dch change in \d field \scl string class definition The use of the first three fields listed above is straightforward. The \textin field specifies the name of the text control file described below in section 7. The \rules and \lexicon fields specify the names of the PC-KIMMO data files. For example, a main control file for Tagalog may contain these lines: \textin tagtxtin.ctl \rules tag.rul \lexicon tag.lex The next two fields, \ach and \dch, require more comment. These fields allow you to make consistent changes in the contents of the \a and \d fields before they are written to the output file. It works like this: the ANALYSIS module processes an input word from the text and returns its gloss and lexical form in \a and \d fields. KTEXT then applies any changes that have been specified in \ach and \dch fields and then writes the results to the output file. For example, the Tagalog main control file may contain these lines: \dch "I-" "in-" \dch "U-" "um-" The parser returns the lexical forms I- and U-, which is how they are found in the PC-KIMMO Tagalog lexicon (these are essentially special symbols represented infixes). The \dch fields change these forms into in- and um-, which is their typical phonological shape. The changes can also be restricted to apply only in certain environments. The \ach and \dch fields work identically to the \ch fields used in the text control file, described in detail in section 7.1. The last field in figure 6 above is the \scl field, which is a string class definition field. It allows you to define a special symbol to stand for a set of characters; for instance, this string class field defines the symbol Vowel to stand for the set of vowels: \scl Vowel a e i o u The symbol Vowel can then be used in the environments of \ach and \dch fields. String class definitions are described in detail in section 7.1.4. When KTEXT reads the main control file, it ignores any lines beginning with field codes other than those listed in figure 6. For example, a line beginning \co would be ignored. Such lines are treated as comments. Comments in the control file can also be indicated with the comment character, which by default is semicolon. This is the only way to place comments on the same line as a field. The comment character can be changed with the command line option -c when running KTEXT (see section 3). The main control file must use the same comment character as the rules and lexicon files. The following shows a sample main control file. \id tag.ctl - KTEXT main control file for Tagalog, 7-Mar-91 ; select the various other files \textin tagtxtin.ctl \rules tag.rul \lexicon tag.lex ; fix up some underlying forms \dch "I-" "in-" \dch "U-" "um-" 7 The TXTIN control file[7] 7.1 Text orthography changes 7.1.1 Basic changes 7.1.2 Environmentally constrained changes 7.1.3 Where orthography changes apply 7.1.4 A sample orthography change table 7.1.5 Orthography change (\ch) 7.1.6 String class definition (\scl) 7.2 Words or format markers? 7.2.1 Word formation characters (\wfc) 7.2.2 Primary format marker character (\format) 7.2.3 Secondary format marker character (\barchar) 7.2.4 Single character bar codes (\barcodes) 7.3 Selecting fields 7.3.1 Fields to exclude (\excl) 7.3.2 Fields to include (\incl) 7.4 Special output characters 7.4.1 Ambiguity marker (\ambig) 7.4.2 Morpheme decomposition separator (\dsc) 7.5 Controlling capitalization 7.6 A sample text input control file The TXTIN module applies to a text, splitting off the punctuation, format marking, white space (space, tab, carriage return), and capitalization information. It passes just the words of the text on to the ANALYSIS module, in a normalized, lower case form after making any user-specified orthographic changes. The TXTIN module requires three types of control information. (1) To identify words, TXTIN must know what letters make up words. It assumes that the alphabetic characters (a to z, upper and lower case) are used to make words; these are called the standard word formation characters. In addition to these there may be characters like tilde (~) and apostrophe (') in words like canon (Spanish), don't (English), etc. These are called nonstandard word formation characters. (2) It is desirable to apply KTEXT directly to texts in their practical orthographies, but to maintain the files the parser needs in a more linguistically-appropriate orthography. For example, Latin x can be converted to ks; Quechua long vowels, represented practically by doubling the vowel, can be converted to a single vowel followed by a colon (i.e., aa is converted to a:); and Campa nasals occurring before a noncontinuant can be represented as the morphophoneme N, unspecified for point of articulation (i.e., mp is converted to Np). This kind of change is made possible by the text input orthography changes, the rules defined for changing the orthography. (3) KTEXT incorporates rather specific ideas about how formatting information is given in texts. Some details of how formatting marks are separated from the words in the text are provided by the special formatting information. The text input control file influences how KTEXT reads the input text files, and, to some degree, the format of the output analysis files. Like the other input control files, it is structured as a standard format database file. Figure 7 shows the fields available in the text control file: Figure 7 Text input control file field codes Code Description --------- ----------------------- \ambig analysis output ambiguity marker \barchar secondary format marker \barcodes single character bar codes \ch orthography change \dsc morpheme decomposition separator \excl fields to exclude \format primary format marker \incl fields to include \luwfc lower-upper word formation characters \noincap disable word-internal capitalization \scl string class definition \wfc word formation characters When KTEXT reads the text input control file, it ignores any lines beginning with field codes other than those listed in figure 7. For example, a line beginning \co would be ignored. Such lines are treated as comments. Comments in the control file can also be indicated with the comment character, which by default is semicolon. This is the only way to place comments on the same line as a field. The comment character can be changed with the command line option -c when running KTEXT (see section 3). The main control file must use the same comment character as the rules and lexicon files. 7.1 Text orthography changes 7.1.1 Basic changes To substitute one string of characters for another, these must be made known to the program in a change. (The technical term for this sort of change is a production, but we will simply call them changes.) In the simplest case, a change is given in three parts: (1) the field code \ch must be given at the extreme left margin to indicate that this line contains a change; (2) the match string is the string for which KTEXT must search; and (3) the substitution string is the replacement for the match string, wherever it is found. The beginning and end of the match and substitution strings must be marked. The first printing character following \ch (with at least one space or tab between) is used as the delimiter for that line. The match string is taken as whatever lies between the first and second occurrences of the delimiter on the line and the substitution string is whatever lies between the third and fourth occurrences. For example, the following lines indicate the change of hi to bye, where the delimiters are the double quote mark ("), the single quote mark ('), the period (.), and the at sign (@). \ch "hi" "bye" \ch 'hi' 'bye' \ch .hi. .bye. \ch @hi@ @bye@ Throughout this document, we use the double quote mark as the delimiter unless there is some reason to do otherwise. Change tables follow these conventions: (1) Any characters (other than the delimiter) may be placed between the match and substitution strings. This allows various notations to symbolize the change. For example, the following are equivalent: \ch "thou" "you" \ch "thou" to "you" \ch "thou" > "you" \ch "thou" --> "you" \ch "thou" becomes "you" (2) Comments included after the substitution string are initiated by a semicolon (;), or whatever is indicated as the comment character by means of the -c option when KTEXT is started. The following lines illustrate the use of comments: \ch "qeki" "qiki" ; for cases like wawqeki \ch "thou" "you" ; for modern English (3) A change can be ignored temporarily by turning it into a comment field. This is done either by placing an unrecognized field code in front of the normal \ch, or by placing a semicolon (;) in front of it (the default comment character). For example, only the first of the following three lines would effect a change: \ch "nb" "mp" \no \ch "np" "np" ;\ch "mb" "nb" KTEXT applies a change table as an ordered set of changes. The first change is applied to the entire word by searching from left to right for any matching strings and, upon finding any, replacing them with the substitution string. After the first change has been applied to the entire word, then the next change is applied, and so on. Thus, each change applies to the result of all prior changes. When all the changes have been applied, the resulting word is returned. For example, suppose we have the following changes: \ch "aib" > "ayb" \ch "yb" > "yp" Consider the effect these have on the word paiba. The first changes i to y, yielding payba; the second changes b to p, to yield paypa. (This would be better than the single change of aib to ayp if there were sources of yb other than the output of the first rule.) The way in which change tables are applied by KTEXT allows certain tricks. For example, suppose that for Quechua, we wish to change hw to f, so that hwista becomes fista and hwis becomes fis. However, we do not wish to change the sequence shw or chw to sf or cf (respectively). This could be done by the following sequence of changes. (Note, @ and $ are not otherwise used in the orthography.) \ch "shw" > "@" ; (1) \ch "chw" > "$" ; (2) \ch "hw" > "f" ; (3) \ch "@" > "shw" ; (4) \ch "$" > "chw" ; (5) Lines (1) and (2) protect the sh and ch by changing them to distinguished symbols. This clears the way for the change of hw to f in (3). Then lines (4) and (5) restore @ and $ to sh and ch, respectively. (An alternative, simpler way to do this is discussed in the next section.) 7.1.2 Environmentally constrained changes It is possible to impose string environment constraints (SEC's) on changes in the orthography change tables. The syntax of SEC's is described in detail in section 7.2. For example, suppose we wish to change the mid vowels (e and o) to high vowels (i and u respectively) immediately before and after q. This could be done with the following changes: \ch "o" "u" / _ q / q _ \ch "e" "i" / _ q / q _ This is not entirely a hypothetical example; some Quechua practical orthographies write the mid vowels e and o. However, in the environment of /q/ these could be considered phonemically high vowels /i/ and /u/. Changing the mid vowels to high upon loading texts has the advantage that--for cases like upun `he drinks' and upoq `the one who drinks'--the root needs to be represented internally only as upu `drink'. But note, because of Spanish loans, it is not possible to change all cases of e to i and o to u. The changes must be conditioned. In reality, the regressive vowel-lowering effect of /q/ can pass over various intervening consonants, including /y/, /w, /l/, /ll/, /r/, /m/, /n/, and /n/. For example, /ullq/ becomes ollq, /irq/ becomes erq, etc. Rather than list each of these cases as a separate constraint, it is convenient to define a class (which we label +resonant) and use this class to simplify the SEC. Note that the string class must be defined (with the \scl field code) before it is used in a constraint. \scl +resonant y w l ll r m n n~ \ch "o" "u" / q _ / _ ([+resonant]) q \ch "e" "i" / q _ / _ ([+resonant]) q This says that the mid vowels become high vowels after /q/ and before /q/, possibly with an intervening /y/, /w/, /l/, /ll/, /r/, /m/, /n/, or /n/. Consider the problem posed for Quechua in the previous section, that of changing hw to f. An alternative is to condition the change so that it does not apply adjacent to a member of the string class Affric which contains s and c. \scl Affric c s \ch "hw" "f" / [Affric] ~_ It is sometimes convenient to make certain changes only at word boundaries, that is, to change a sequence of characters only if they initiate or terminate the word. This conditioning is easily expressed, as shown in the following examples. \ch "this" "that" ; anywhere in the word \ch "this" "that" / # _ ; only if word initial \ch "this" "that" / _ # ; only if word final \ch "this" "that" / # _ # ; only if entire word 7.1.3 Using text orthography changes The purpose of orthography change is to convert text from an external orthography to an internal representation more suitable for morphological analysis. In many cases this is unnecessary, the practical orthography being completely adequate as KTEXT's internal representation. In other cases, the practical orthography is an inconvenience that can be circumvented by converting to a more phonemic representation. Let us take a simple example from Latin. In the Latin orthography, the nominative singular masculine of the word king is rex. However, phonemically, this is really /reks/; /rek/ is the root meaning king and the /s/ is an inflectional suffix. If KTEXT is to recover such an analysis, then it is necessary to convert the x of the external, practical orthography into ks internally. This can be done by including the following orthography change in the text input control file: \ch "x" "ks" In this, x is the match string and ks is the substitution string, as discussed in chapter 8. Whenever x is found, ks is substituted for it. Let us consider next an example from Huallaga Quechua. The practical orthography currently represents long vowels by doubling the vowel. For example, what is written as kaa is /ka:/ 'I am', where the length (represented by a colon) is the morpheme meaning 'first person subject'. Other examples, such as upoo /upu:/ 'I drink' and upichee /upi-chi-:/ 'I extinguish', motivate us to convert all long vowels into a vowel followed by a colon. The following changes do this: \ch "aa" "a:" \ch "ee" "i:" \ch "ii" "i:" \ch "oo" "u:" \ch "uu" "u:" Note that the long high vowels (i and u) have become mid vowels (e and o respectively); consequently, the vowel in the substitution string is not necessarily the same as that of the match string. What is the utility of these changes? In the lexicon, the morphemes can be represented in their phonemic forms; they do not have to be represented in all their orthographic variants. For example, the first person subject morpheme can be represented simply as a colon (-:), rather than as -a in cases like kaa, as -o in cases like qoo, and as -e as in cases like upichee. Further, the verb 'drink' can be represented as upu and the causative suffix (in upichee) can be represented as -chi; these are the forms these morphemes have in other (nonlowered) environments. As the next example, let us suppose that we are analyzing Spanish, and that we wish to work internally with k rather than c (before a, o, and u) and qu (before i and e). (Of course, this is probably not the only change we would want to make.) Consider the following changes: \ch "ca" "ka" \ch "co" "ko" \ch "cu" "ku" \ch "qu" "k" The first three handle c and the last handles qu. By virtue of including the vowel after c, we avoid changing ch to kh. There are other ways to achieve the same effect. One way exploits the fact that each change is applied to the output of all previous changes. Thus, we could first protect ch by changing it to some distinguished character (say @), then changing c to k, and then restoring @ to ch: \ch "ch" "@" \ch "c" "k" \ch "@" "ch" \ch "qu" "k" Another approach conditions the change by the adjacent characters. The changes could be rewritten as \ch "c" "k" / _a / _o / _u ; only before a, o, or u \ch "qu" "k" ; in all cases The first change says, "change c to k when followed by a, o, or u." (This would, for example, change como to komo, but would not affect chal.) The syntax of such conditions is exactly that used in string environment constraints; see section 7.2. 7.1.1 Where orthography changes apply Orthography changes are made when the text being analyzed may be written in a practical orthography. Rather than requiring that it be converted as a prerequisite to morphological analysis, it is possible to have KTEXT convert the orthography as it loads and analyzes each word, before any analysis is performed. The changes loaded from the text input control file are used in the module TXTIN, after all the text is converted to lower case (and the information about upper and lower case, along with information about format marking, punctuation and white space, has been put to one side.) Consequently, the match strings of these orthography changes should be all lower case; any change that has an uppercase character in the match string will never apply. 7.1.2 A sample orthography change table We include here the entire orthography input change table for Caquinte (Campa). There are basically four changes that need to be made: (1) nasals, which in the practical orthography reflect their assimilation to the point of articulation of a following noncontinuant, must be changed into an unspecified nasal, represented by N; (2) c and qu are changed to k; (3) j is changed to h; and (4) gu is changed to g before i and e. Figure 8 Caquinte orthography change table \ch "mp" "Np" ; for unspecified nasals \ch "nch" "Nch" \ch "nc" "Nk" \ch "nqu" "Nk" \ch "nt" "Nt" \ch "ch" "@" ; to protect ch \ch "c" "k" ; other c's to k \ch "@" "ch" ; to restore ch \ch "qu" "k" \ch "j" "h" \ch "gue" "ge" \ch "gui" "gi" This change table can be simplified by the judicious use of string environment constraints: Figure 9 Simplified Caquinte orthography change table \ch "m" > "N" / _p \ch "n" > "N" / _c / _t / _qu \ch "c" > "k" / _~h \ch "qu" > "k" \ch "j" > "h" \ch "gu" > "g" / _e /_i 7.1.3 Orthography change (code \ch) As suggested by the preceding examples, the text orthography change table is composed of all the \ch fields found in the text input control file. These may appear anywhere in the file relative to the other fields. It is recommended that all the orthography changes be placed together in one section of the text input control file, rather than being mixed in with other fields. 7.1.4 String class definition (code \scl) String classes are defined using the \scl field code. The members of string classes are literal strings or single characters. Any number of string classes may be defined, and any class may contain any number of strings. These strings may be of any length, although they usually represent phonological segments. String class names can be used in the string environment constraints of following changes. String classes must be defined before being used. For example, the first two lines of the Caquinte example above could be given as follows: \scl -bilabial c t qu \ch "m" > "N" / _ p \ch "n" > "N" / _ [-bilabial] The string class definition could be in the main control file: string classes defined there can be used in the text input control file as well. 7.2 Words or format markers? KTEXT may sometimes be applied to a pure text file, such as a wordlist. Usually, however, there may be formatting information (i.e., punctuation and some sort of descriptive markup) mixed in with the words. KTEXT needs to differentiate between the words and everything else in the input text file. The fields discussed in this section allow the user to inform KTEXT how to recognize words and how to recognize formatting information. 7.2.1 Word formation characters (codes \wfc and \luwfc) To break a text into words, KTEXT needs to know which characters are used to form words. It always assumes that the letters A to Z and a to z will be used as word formation characters. (Note that uppercase letters are converted to lowercase letters when KTEXT reads a text file.) If the orthography of the language the user is working in uses any other characters, these must given in a \wfc field in the text input control file. For example, Quechua uses tilde (~) and an accent mark ('). This information is provided by the following example: \wfc ~ ; needed for words like nin~o \wfc ' ; needed for words like papa' Notice that the characters may be separated by spaces, although it is not required to do so. If more than one \wfc field occurs in the text input control file, KTEXT uses the combination of all characters defined in all such fields as word formation characters. The \wfc field is also used to declare accented (or eight bit) characters, such as those available in the IBM extended character set. For example, \wfc ç Ä Æ ù £ ê Å ô ÿ ¥ û ä KTEXT automatically converts the upper case characters A-Z to their equivalent lower case characters a-z. You can also declare other pairs of characters as lower-to-upper case pairs. This is especially useful when using accented characters (such as those available in the IBM extended character set). Lower-to-upper case pairs are declared in a field beginning with the code \luwfc (for "lower-upper word formation characters"). For each following pair of characters, the first character is the lower case equivalent of the second (which is assumed to be upper case). Several such pairs can be placedin the field or they may be placed in separate fields. Whitespace can be used in the field freely. Characters that are declared in a \luwfc do not also have to be included in a \wfc field. For example, \luwfc Ä â ƒ å û ä After reading the text input control file, KTEXT reports the full set of word formation characters being used. This is what KTEXT would report for the Quechua example above: Using the following as word-formation characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~' The comment character (normally ;) cannot be designated as a word formation character. If the orthography includes semicolon (;), then a different comment character must be defined with the - c command line option when KTEXT is initiated; see section 3. 7.2.2 Primary format marker character (code \format) KTEXT has a simple view of format markers: they consist of one or more contiguous characters beginning with a special flag character. The default character initiating format markers is the backslash (\). Thus, each of the following would be recognized as a format marker and would not be analyzed by KTEXT: \ \p \sp \xx(yes) \very-long.and;muddled/format*marker,to#be$sure If \ is used in the orthography, or some other character is used to flag format markers, it is possible to change to another format flag character with a \format field in the text input control file. This field designates a single character to replace the default \. For example, if the format markers in the text files begin with the at sign (@), the following should be placed in the text input control file: \format @ ; format markers start with at sign This would be used, for example, if the text contained format markers like the following: @ @p @sp @xx(yes) @very-long.and;muddled/format*marker,to#be$sure Note that format markers cannot have a space or tab embedded in them; the first space or tab encountered terminates the format marker as far as KTEXT is concerned. If a \format field occurs in the text input control file without a following character to serve for flagging format markers, then KTEXT will not recognize any format markers and will try to parse everything other than punctuation characters. It makes sense to use the \format field only once in the text input control file. If multiple \format fields do occur in the file, KTEXT uses only the value given in the first one. KTEXT uses only the first printing character following the \format field code. The same character cannot be used for flagging both format markers in text input files and comments in control input files. Thus, semicolon (;) cannot normally be used to flag format markers. One final note: the format character under discussion here applies only to the input text files which are to be analyzed. It has absolutely nothing to do with the use of backslash (\) to flag field codes in the control files read by KTEXT. 7.2.3 Secondary format marker character (code \barchar) In addition to the general format markers discussed above, KTEXT assumes a secondary type of marker which has a very restricted form. It consists of a flag character followed by a single character from a list of known values. It is typically used to indicate type style, such as bold, italics, and so on. This secondary flag character must be different than the one associated with the \format field. Its default value is the vertical bar (|), causing this type of format marker to be frequently called a bar code. The following could be valid (secondary) format markers and would not be analyzed by KTEXT: |b |i |r (These codes typically stand for bold, italics, and regular, respectively.) Consider the following two lines of input text: \bgoodbye\r |bgoodbye|r Using the default definitions of KTEXT, the first line is considered to be a single format marker, and provides nothing which the program should try to parse. The second line, however contains two format markers, |b and |r, and the word goodbye which will be analyzed by KTEXT. If | is used in the orthography, or some other character is used to flag format markers, the flag character can be changed with a \barchar field in the text input control file. This field designates a single character to replace the default |. For example, if this type of format marker begins with the dollar sign ($), the following should be placed in the text input control file: \barchar $ ; "bar codes" start with $ This would cause KTEXT to consider the following to be valid format markers: $b $i $r An empty \barchar field in the text input control file causes KTEXT to not recognize any bar code format markers. Thus, the following field effectively turns off special treatment of this style of format marking: \barchar ; no bar character It makes sense to use the \barchar field only once in the text input control file. If multiple \barchar fields do occur in the file, KTEXT uses only the value given in the first one. KTEXT uses only the first printing character following the \barchar field code. The same character cannot be used for marking both bar codes in the text file and comments in the input control files. Thus, semicolon (;) cannot normally be used as the bar code marker. 7.2.4 Single character bar codes (code \barcodes) In conjunction with the special format marking character discussed in the previous section, the \barcodes field defines the individual characters used with in bar codes. These characters may be separated by spaces or lumped together. Thus, the following two fields are equivalent: \barcodes abcdefg ; lumped together \barcodes a b c d e f g ; separated If provided more than one \barcodes field in the text input control file, KTEXT uses the combination of all characters defined in all such fields. No check is made for repeated characters: the previous example would be accepted without complaint despite the redundancy of the second line. The default value for the bar codes is bdefhijmrsuvyz. Therefore, if the text input control file contains neither a \barchar nor a \barcodes field, the following bar codes are considered to be formatting information by KTEXT: |b, |d, |e, |f, |h, |i, |j, |m, |r, |s, |u, |v, |y, and |z. These are exactly the codes recognized by the SIL MS (Manuscripter) program. 7.3 Selecting fields There are times when it is undesirable for KTEXT to analyze every field of a text input file. For instance, texts often begin with identification lines to record authorship and state of revision. There is no reason why this information should be morphologically parsed. It may not even be in the same language! KTEXT considers a field of an input text file to be everything from one format marker to the next (or to the end of the file). This is different than the definition of fields in the input control files, which require field codes to be at the beginning of a line. Even though it seems a bit strange to mix the concepts of fields and format marking, this has proven to be useful in practice. (However, the structure of a formatted text may not look that different from the types of database files used by KTEXT, especially the text approximates the style of descriptive markup. In the next two sections, we will discuss two fields for controlling what parts of a file KTEXT applies to. It does not make sense to include both of these in the same text input control file. The one which best fits the task at hand must be chosen. 7.3.1 Fields to exclude (code \excl) The \excl field excludes one or more fields from processing by KTEXT. For example, to have KTEXT ignore everything in \co and \id fields, the following line is included in the text input control file: \excl \co \id ; ignore these fields If more than one \excl field is found in the text input control file, KTEXT keeps adding the contents of each field to the overall list of text fields to exclude. This list is initially empty, and stays empty unless the text input control file contains an \excl field. Thus, KTEXT normally does not exclude any text fields from processing. If the text input control file contains \excl fields, then only those text fields are not processed. Every word in every text field not mentioned explicitly in an \excl field will be analyzed. 7.3.2 Fields to include (code \incl) The \incl field explicitly includes one or more text fields for processing by KTEXT, excluding all other fields. For instance, to have KTEXT process everything in \txt and \qt fields, but ignore everything else, the following line is placed in the text input control file: \incl \txt \qt ; process these fields If more than one \incl field is found in the text input control file, KTEXT keeps adding the contents of each field to the overall list of text fields to process. This list is initially empty, and stays empty unless the text input control file contains an \incl field. If the text input control file contains \incl fields, then only those text fields are processed. Every word in every text field not mentioned explicitly in an \incl field will not be analyzed. Note that KTEXT processes every text field in the text input files unless the text input control file contains either an \excl or an \incl field. One or the other is used to limit processing, but never both. 7.4 Special output characters The last two fields provided in the text input control file change certain special characters in the analysis output file. This may be required by the orthography of the language to which KTEXT is being applied. 7.4.1 Ambiguity marker (code \ambig) The morphological analysis performed by KTEXT may result in multiple parses, an ambiguity which the computer program cannot resolve. It is also possible for KTEXT to fail altogether in trying to analyze a word. These two possibilities are normally shown in the analysis output file as follows: \a %3%< N0 kay >%< V1 ka > IMP%< V1 ka > INF% \a %0%qoyka:rala:may% This works fine unless the percent sign (%) is used in the orthography. The \ambig field controls the character used to mark ambiguities and failures in the analysis output file. For example, to use the hash mark (#), the text input control file should include: \ambig # ; % isn't good enough This would cause the sample analysis to be output as follows: \a #3#< N0 kay >#< V1 ka > IMP#< V1 ka > INF# \a #0#qoyka:rala:may# It makes sense to use the \ambig field only once in the text input control file. If multiple \ambig fields do occur in the file, KTEXT uses only the value given in the first one. If the text input control file does not have an \ambig field, KTEXT uses the %. KTEXT uses only the first printing character following the \ambig field code. The same character cannot be used for marking both ambiguities in the analysis output file and comments in the input control files. Thus, semicolon (;) cannot normally be used as the ambiguity marker. 7.4.2 Morpheme decomposition separator (code \dsc) When KTEXT asks whether to include the morpheme decomposition field in the output, if the user responds positively, it produces results like the following: \a < V2 *qu > IN PLDIR POL 1O IMP \d qo-yka-:ra-lla:-ma-y \a %3%< N0 kay >%< V1 ka > IMP%< V1 ka > INF% \d %3%kay%ka-y%ka-y% Note that the allomorph strings in the decomposition (\d) field are separated by dashes (-). This works fine unless the language uses the dash in its orthography. The \dsc field controls the character used to separate the morphemes in the decomposition field. For example, one might use the equal sign (=) by including the following in the text input control file: \dsc = ; - is used by the orthography This would cause the sample analysis to be output as follows: \a < V2 *qu > IN PLDIR POL 1O IMP \d qo=yka=:ra=lla:=ma=y \a %3%< N0 kay >%< V1 ka > IMP%< V1 ka > INF% \d %3%kay%ka=y%ka=y% It makes sense to use the \dsc field only once in the text input control file. If multiple \dsc fields do occur in the file, KTEXT uses the value given in the first one. If the text input control file does not have an \dsc field, KTEXT uses a dash (-). KTEXT uses only the first printing character following the \dsc field code. The same character cannot be used both for separating decomposed morphemes in the analysis output file and for marking comments in the input control files. Thus, one normally cannot use semicolon (;) as the decomposition separation character. 7.5 Controlling capitalization KTEXT records the capitalization pattern of each word in the text file. Besides the typical case of a word whose initial letter is capitalized (because it is a proper noun or because it is the first word in a sentence), there are two special cases: words with mixed capitals and words in all capitals. First, for words with mixed capitals (such as MacDonald), the capitalization of each letter is recorded through the first thirteen letters of the word (this limitation is due to the length of the bit field used to record capitalization information). Second, words in all capitals are specially marked as such and capitalization is recorded no matter how long the word is. Word-internal capitalization can be disabled by using the \noincap option in the input text control file. This feature will likely only be of use if you intend to translate KTEXT's output into another language and you know that the internal recapitalization is likely to be wrong. 7.6 A sample text input control file The following is the complete text input control file for Huallaga Quechua: \id HGTEXT.CTL - for Huallaga Quechua, 25-May-88 \co WORD FORMATION CHARACTERS \wfc ' ~ \co FIELDS TO EXCLUDE \excl \id ; identification fields \co ORTHOGRAPHY CHANGES \ch "aa" > "a:" ; for long vowels \ch "ee" > "i:" \ch "ii" > "i:" \ch "oo" > "u:" \ch "uu" > "u:" \ch "qeki" > "qiki" ; for cases like wawqeki \ch "~n" > "n~" ; for typos ; for Spanish loans like hwista \scl sib s c ; sibilants \ch "hw" > "f" / ~[sib]_ 8 The output data file KTEXT formats its output as a database, each record of which corresponds to a word of the source text. The first field of each entry contains the analysis, the second field the morpheme decomposition, and the third field (which is optional, see section 3 on using the -w option) the original word. Other fields, which may or may not occur in any given entry, contain information about the capitalization of the word, format marking, punctuation, and white space. The fields and their field codes are as shown in figure 10: Figure 10 Field codes produced in the analysis Code Description ------- ---------------- \a analysis \d morpheme decomposition \w original word \f preceding format marks \c capitalization \n trailing nonalphabetics For example, suppose that itçtanim (from a Tagalog input text) analyzes unambiguously, and that the original word and the morpheme decomposition are both requested. The resulting analysis file contains the following lines: \a IP DUR < V plant > \d i-RE-tanôm \w itçtanim For some words, KTEXT discovers more than one possible analysis. We call these ambiguities (or multiple parses). In this case, KTEXT puts all the alternatives into the resulting analysis file separated by a percent sign (%), and with a number to indicate how many there are. For example, Quechua kay is a three-fold ambiguity: \a %3%< N0 kay >%< V1 ka > IMP%< V1 ka > INF% \d %3%kay%ka-y%ka-y% \w kay KTEXT may fail to analyze a word from the input text. Analysis failures appear in the resulting analysis file surrounded by percent signs and preceded by the number zero (0), as the following illustrates: \a %0%qoyka:rala:may% \d %0%qoyka:rala:may% \w qoykaaralaamay If you use a log file (see section 3), it will record all instances of analysis failures. To edit failures and ambiguities in the output file, you can use a special editor called CED, which is described in section 9. As has been noted elsewhere, KTEXT has much in common with the program AMPLE, whose text-processing routines KTEXT has borrowed. In order to be able to use other software that expects AMPLE-style output, it is desireable to understand how to reproduce it with KTEXT. There are some features of KTEXT's output file that you cannot change, notably the field code names and their order in a record. Indeed, to remain compatible with AMPLE they should not be changed. But the actual contents of the fields themselves depend entirely on the format of the PC- KIMMO lexicon file and consistent changes speciified in the control files. For example, here is a record from the output file produced by the English example (supplied with KTEXT): \a V(be`gin)+PROG \d be`gin+ing \w beginning And here is a record from the output file produced by the Tagalog example (supplied with KTEXT): \a IP DUR < V plant > \d i-RE-tanôm \w itçtanim The Tagalog example conforms to AMPLE while the English example does not. The salient features of AMPLE output are as follows. (1) AMPLE requires every word to minimally contain a root. Even particles that cannot take affixes are treated like roots. (2) In the \a field of a word record, the root of a word is delimited by angled brackets (<>). In the Tagalog example above, the root of the word is < V plant >. Notice that the left bracket is followed by a space and the right bracket is preceded by a space. (3) Inside the angled brackets that delimit a root, there are exactly two pieces of data: a word class (part of speech) abbreviation and a gloss (or some other representation of the root, such as an underlying form or protoform). (4) Morpheme boundary symbols (such as hyphen) are not used in the \a field. Prefix and suffix glosses are separated from each other and the root gloss by spaces. (5) In the \d field, only one morpheme boundary symbol is recognized; by default it is hyphen (-), but this can be changed with the \dsc field in the input text control file (see section 7.4.2). There are two places where you can tweak KTEXT in order to make it conform to these specifications: the lexicon file and the main control file. The easiest way to get angled brackets around roots is simply to include them in the glosses of all roots in the lexicon. (To be absolutely safe, the brackets should be padded by one space.) For example, here is the lexical entry for the Tagalog verb root tanim: tanôm V_Root "< V plant >" Inside the angled brackets of the root gloss are the word class abbreviation V and the gloss 'plant'. In a typical PC-KIMMO lexicon file, the glosses of affixes normally contain a morpheme boundary symbol; for example: pag- V_Prefix "VR1-" where the - in the gloss VR1- indicates that it is a prefix. Such glosses will incorrectly leave morpheme boundary symbols in the \a field of the output word record. There are two ways to remove morpheme boundary symbols from the \a field. First, replace them with spaces in the lexicon file; for example: pag- V_Prefix "VR1 " Second, leave them in the lexicon file but use a \ach field in the main control file to change them to spaces; for example: \ach "-" " " Your lexicon file may use more than one morpheme boundary symbol. For example, the Tagalog example uses hyphen for prefixes and plus sign for suffixes (the phonological rules require this distinction). But the \d field will only recognize one boundary symbol. This can be fixed by including a \dch field in the main control file that changes plus sign to hyphen: \dch "+" "-" See the Tagalog lexicon file and mail control file for more examples of changes such as these. 9 CED: an editor for failures and ambiguities[8] 9.1 Overview of CED 9.2 Starting the CED editor 9.2.1 Giving CED an input file with the -i option 9.2.2 Giving CED an output file with the -o option 9.2.3 Changing CED's ambiguity marker with the -a option 9.3 Editing for text glossing 9.4 The editing process 9.5 Command summary 9.5.1 Major commands 9.5.2 Word-edit commands 9.1 Overview of CED Sometimes KTEXT fails to analyze a word into morphemes. Such words are referred to as failures, and are flagged as such in the output. For example, tatanpa is flagged as a failure in the following: \a %0%tatanpa% In other cases, KTEXT produces multiple analyses for a given word. Such cases are referred to as ambiguities, and are flagged as such in the output. For example, the Quechua word aywamunchu produces the following output, indicating two possibilities: \a %2%< V1 *aywa > AFAR 3 NEG%< V1 *aywa > AFAR 3 YN?% Each failure or ambiguity begins with a percent sign (%) followed by an integer. This integer represents the number of analyses: 0 (zero) for a failure, 2 if there are two alternatives of an ambiguous word, 3 if there are three alternatives, etc. Each alternative is terminated by a percent sign. If a complete and unambiguous morphological analysis of a text is needed, as would be the case for text glossing, then the analysis produced by KTEXT should be edited to deal with the failures and the ambiguities. CED is an editing program designed specifically for dealing with only the flagged failures and ambiguities. (CED stands for CADA Editor, CADA being an acronym for Computer Assisted Dialect Adaptation.) CED has various virtues: (1) It protects the user from unwanted changes. It allows modification only of failures and ambiguities. Thus, CED is good for users who are not familiar with a more general editing program, with formatting conventions, etc. If needed, subsequent changes can be made with a general-purpose editor. (2) It is easy to learn. Anyone should be able to use CED with 20 minutes of orientation. (3) It is safe for situations where electricity is unstable. It works as a single pass (from the beginning to the end of the file), writing the output as editing is done. To learn CED, skim the remainder of this chapter and then try the program. Don't be dismayed if you have trouble visualizing everything described here; you can always come back and read this after giving CED a try. 9.2 Starting the CED editor CED is run by typing its name in response to the system prompt. After it loads, it prompts for an input file. Suppose that you respond with the filename xxxxxx.ana (followed, of course, by pressing the ENTER key), and that CED finds the file. (If it does not find it, CED requests the filename again.) After finding the input file, CED asks for the name of an output file, proposing that it be named xxxxxx.CED (where xxxxxx is from the input filename). If you wish some other name (e.g., to write the output somewhere other than on the default device), you may type the filename after that prompt. If you are satisfied with CED's suggestion, simply respond by pressing the ENTER key. (Note that the ENTER key may be labeled RETURN on some keyboards.) Rather than wait for CED's prompting, you can designate either the input file or the output file (or both) in the command used to start CED. You can also designate a different ambiguity marker character to match the one given by an \ambig field in the text input control file. A command using all of these options would look like the following (user input is underlined): C> ced -i infile.ana -o outfil.ced -a @ Each of these command line options is discussed below. 9.2.1 Giving CED an input file with the -i option The name of the input file can be given as part of the command, following the -i option. If CED is given an input file in this way, it does not request an input filename. For example, the following two interactions are equivalent in starting CED (user input is underlined): C> ced CED (CADA Editor) version 2.0 (October 1988) File to be edited: mytext.ana or C> ced -i mytext.ana CED (CADA Editor) version 2.0 (October 1988) 9.2.2 Giving CED an output file with the -o option The name of the output file can be given as part of the command, following the -o option. If CED is given an input file in this way, it does not request an output filename. For example, the following two interactions are equivalent in starting CED (user input is underlined): C> ced CED (CADA Editor) version 2.0 (October 1988) File to be edited: mytext.ana Name of output file: [mytext.ced] mytext.out or C> ced -o mytext.out CED (CADA Editor) version 2.0 (October 1988) File to be edited: mytext.ana If an output file is not given with the -o option, CED proposes a name based on the input filename, but asks for confirmation. If you want to use the output filename shown enclosed in brackets, simply respond to the prompt by pressing the ENTER key. 9.2.3 Changing CED's ambiguity marker with the -a option KTEXT ordinarily flags failures and ambiguities in its output with a percent sign (%): \a %0%tatanpa% \a %2%< V1 *aywa > AFAR 3 NEG%< V1 *aywa > AFAR 3 YN?% However, this character can be changed, for example to the at sign (@), by putting the following line in the text input control file: \ambig @ In this case, output would look like the following: \a @0@tatanpa@ \a @2@< V1 *aywa > AFAR 3 NEG@< V1 *aywa > AFAR 3 YN?@ If CED were to be run on such an analysis without informing it that the flagging character is different, it would fail to recognize the failures and ambiguities. To cause CED to recognize a different flagging character, we must include the -a option, followed by the new flagging character, when the program is started. For example, to edit a text in which failures and ambiguities are flagged with @, CED would be initiated as follows (user input is underlined): C> ced -a @ The -a option is compatible with the other command line options (-i and -o), and may either precede or follow them. In the examples given below, we will use % as the flagging character, since it is the default. 9.3 Editing for text glossing An analysis file used for text glossing should include morpheme decomposition fields. Thus, every word has a pair of lines, one the analysis, the other the decomposition. If the analysis failed, the \a field contains the original word, and you must replace it with the correct analysis. Further, the \d field also contains the original word, and you must introduce hyphens (or some other separation character) between the morphemes. An analysis ambiguity looks like the following, where each analysis is paired with the corresponding decomposition: \a %2%< N0 thief > GOAL%< V2 steal > 1O 3% \d %2%suwa-man%suwa-ma-n% (Note that suwa-man corresponds to < N0 thief > GOAL, and suwa-ma-n to < V2 steal > 1O 3.) For each analysis, there is a decomposition, so when you choose a particular analysis, CED automatically chooses the corresponding decomposition. This greatly simplifies the task of editing ambiguities. 9.4 The editing process CED splits the screen into two windows. Text is displayed in the upper window, with a failure or ambiguity highlighted. Among the alternatives of an ambiguity, the current alternative is given special highlighting to distinguish it from the others. The flagging (%) does not appear in the display of the site being edited. The lower window contains the item to be edited, either a failure or the analysis selected from the alternatives of an ambiguity. Prompts and helps are also displayed in the bottom window. To edit an ambiguity, you select, delete, or modify the current alternative (the one that is highlighted). To select the current alternative, press the ENTER key (which may be labeled RETURN instead of ENTER), whereupon the other alternatives are discarded and the selected analysis appears in the lower part of the screen. The cursor appears after the last character. You may now modify the word, using the word-edit commands. When you are finished editing the word, press the ENTER key. CED then asks "Is this what you want?" You may approve it by pressing the ENTER key again. If, on the other hand, you wish to go back and make more changes, type n and then press the ENTER key. At this point all of the commands are available. For instance, if you would like to restore this edit site to its original form (with all the original alternatives) you may undo all modifications by typing u. Whenever only one alternative remains (whether this has been brought about by a selection or a series of deletions) the remaining alternative is displayed on the lower portion of the screen for editing and verification. Because failures have only one alternative, whenever CED encounters one, it is automatically displayed in the lower portion of the screen, whereupon you may modify it. There are two cases in which you could be finished at an edit site: (1) You may wish to leave things as they are, to be corrected later; you indicate this by typing c (continue). If the cursor is in the lower window, you must first press the ENTER key. When CED asks "Is this what you want?", type n and then press the ENTER key. Then you may give the c (continue) command to CED leave this edit as it is. (2) You may be satisfied with the word as edited (of course you don't have to change anything) so you press the ENTER key twice, once to stop editing and once to verify that you are satisfied. In both cases the text is then updated to reflect any changes you have made. CED then moves on to the next site. CED removes the markers at an edit site whenever you (by various manipulations) arrive at the word you want and subsequently verify it. If you defer a decision concerning how a site should be modified, the markers are not removed so that you can edit these sites again with CED. If you are unable to finish editing a text, you can direct CED to pass the remainder of the input unchanged to the output file by typing q (quit). (If the cursor is in the lower window, you must first press the ENTER key and then respond with n to the query "Is this what you want?" to get the full list of command options.) This does not undo any edits you have made previously. Subsequently, you may continue from where you left off by again editing the modified text with CED. In this case, the name of your input file probably ends with CED, and CED will suggest exactly the same name for the output. If you accept this (making the name of the output and input files identical) CED will complain and ask for another output file. So do one of two things: (1) rename the input file to something like xxxxxx.tem before you starting CED, or (2) when CED asks for the name of the output file (suggesting xxxxxx.ced) type a different name. 9.5 Command summary CED has two levels of command, major commands and word-edit commands. The major commands involve actions at the level of an entire edit site or of the file, whereas word-edit commands involve modifications to particular word, carried out in the lower window. We now describe the commands available at these two levels. 9.5.1 Major commands The major commands are single letters. CED does not wait for ENTER key to be pressed before processing a command; indeed, the ENTER key is a specific command. The commands are as follows: (1) c (continue) leaves this set of alternatives as they are and goes on to the next edit site. (2) d deletes the current alternative. (3) e edits (i.e., allows modification to) the current alternative; the word-edit commands listed below (in section 9.5.2) become available. (4) q quits, that is, terminates this edit session. All modifications previously made are retained in the output file. All subsequent editing sites are passed to the output unmodified (to be dealt with in a later editing session). (5) u undoes any modifications made at this site, that is, it restores the edit site to the form it had in the input file. (6) ? or h displays a help message describing each of these commands in the bottom window. If the window is too small to display the entire message, CED pauses after filling the window and waits for the ENTER key to be pressed before displaying more of the help message. (7) ENTER selects the current alternative, deleting all others and putting the current alternative into the edit window. (This is the single key labeled ENTER or RETURN, not the string E n t e r!) After any modifications and your approval, this alternative is put into the output text and the other alternatives are discarded. (8) Space moves to the next alternative, making it the current alternative. (This is the space bar, not the string S p a c e!) When at the last alternative, a space makes the first alternative into the current one. Any character which is not recognized as a command serves the same function. 9.5.2 Word-edit commands The word-edit commands are described in the following list. (CTRL/X refers to the character generated by holding the CTRL key down while simultaneously typing x.) (1) <- (the left arrow key) and CTRL/B move the cursor one character to the left. If the cursor is on the first character, it moves to the end of the word. (2) -> (the right arrow key) and CTRL/F move the cursor one character to the right. If the cursor is at the end of the word, it moves to the first character of the word. (3) DELETE, BACKSPACE, and CTRL/H delete the character to the left of the cursor. (4) CTRL/U and CTRL/W delete the entire word being edited, allowing a completely new word to be entered. (5) CTRL/R restores the original word, undoing any editing changes which you have made. (6) ? displays a message in the bottom window describing each of these word-edit commands. If the window is too small to display the entire message, CED pauses after filling the window and waits for the ENTER key to be pressed before displaying more of the message. (7) ENTER puts the word as it now appears into the output text (provided you subsequently verify that this is what you want). (8) Any other character is inserted to the left of the cursor. NOTES 1 The particular choice of field markers and the order of fields in a record is due to the fact that KTEXT uses the same text-handling routines as an existing program called AMPLE (Weber et al., 1988). This has the advantage that KTEXT's output is compatible with that program, but the disadvantage that the record structure is perhaps not consistent with terminology already established for PC-KIMMO. It should also be noted that the quasi-database design of KTEXT 's output is used by many other programs developed by the Summer Institute of Linguistics. 2 Tagalog, also known now as Pilipino or Filipino, is a major language of the Philippines. 3 IT (pronounced "eye-tee") is an interlinear text editor that maintains the vertical alignment of the interlinear lines of text and uses a lexicon to semi-automatically gloss the text. See Simons and Versaw (1991) and Simons and Thomson (1988). 4 ITF was developed by the Academic Computing Department of the Summer Institute of Linguistics. It runs under MS-DOS, UNIX, and the Apple Macintosh. 5 TEX is a typesetting language developed by Donald Knuth (see Knuth, 1986). 6 The plain text version of this documentation does not include figure 4, since it is an image of typeset output. 7 This section is adapted from chapters 7, 8, and 9 of Weber et al. 1988. 8 The CED program is not available for Macintosh. REFERENCES Antworth, Evan L. 1990. PC-KIMMO: a two-level processor for morphological analysis. Occasional Publications in Academic Computing No. 16. Dallas, TX: Summer Institute of Linguistics. Bloomfield, Leonard. 1917. Tagalog texts with grammatical analysis. Urbana, IL: University of Illinois. Kew, Jonathan and Stephen R. McConnel. 1991. Formatting interlinear text. Occasional Publications in Academic Computing No. 17. Dallas, TX: Summer Institute of Linguistics. Knuth, Donald E. 1986. The TEXbook. Reading, MA: Addison Wesley Publishing Company. Simons, Gary F., and John Thomson. 1988. How to use IT: interlinear text processing on the Macintosh. Edmonds, WA: Linguist's Software. Simons, Gary F., and Larry Versaw. 1991. How to use IT: a guide to interlinear text processing, 3rd ed. Dallas, TX: Summer Institute of Linguistics. Weber, David J., H. Andrew Black, and Stephen R. McConnel. 1988. AMPLE: a tool for exploring morphology. Occasional Publications in Academic Computing No. 12. Dallas, TX: Summer Institute of Linguistics.